529Rec'dPCT7PTC 02 NOV 2000 



£ 09/674583 




DESCRIPTION 



DEVICE AND METHOD FOR PATTERN RECOGNITION, AND PROVIDING MEDIUM 
Technical Field 

The present invention is directed to a device and a method for 
pattern recognition, and to a providing medium. More particularly, 
the invention relates to a device and a method for pattern recognition 
for recognizing words that have been spoken under a noisy environment, 
and to a providing medium- 
Background Art 

Heretofore, methods for discriminating words that are spoken under 
noisy environments have been devised; as the typical methods, PMC 
(Parallel Model Combination) method, SS/NS (Spectral 

Subtraction/Nonlinear Spectral Subtraction) method, SFE (Stochastic 
Feature Extraction) method and others are known. 

In any of the above-mentioned methods, a feature quantity of a voice 
data of a spoken voice that exists together with an environment noise 
is extracted, and it is judged that which of the acoustic models that 
are corresponding to the previously registered plural words is the one 
to which the feature quantity is most matching, and then the word that 
corresponds to the most matched acoustic model is output as the result 
of the recognition. 

The features of the above-mentioned methods are described below. 
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That is, as to PMC method, the correct recognition rate is excellent 
because environment noise information is directly incorporated in an 
acoustic model, but the calculation cost becomes high (high-level 
computation is required, therefore, the scale of the device becomes 
large, and the time needed for the processing is longer) . As to SS/NSS 
method, the environment noise is eliminated on a stage for extracting 
the feature quantity of the voice data. Hence, the calculation cost 
is lower than that of PMC method, and so, this method is used in many 
cases at present. In this connection, the feature quantity of the voice 
data is extracted as a vector, in SS/NSS method. As to SFE method, 
the environment noise is eliminated on the stage for extracting the 
feature quantity of the mixed signal, in the same way as SS/NSS method, 
however, the feature quantity is extracted as the probability 
distribution. 

By the way, in SFE method, the environment noise is not directly 
reflected on the speech recognition stage, that is, the information 
of the environment noise is not directly incorporated in the silence 
acoustic model, and so, there has been such a problem that the correct 
recognition rate is insufficient. 

In addition, because the information of the environment noise is 
not directly incorporated in the silence acoustic model, as the time 
from the time point at which the speech recognition has been started 
till the time point at which the speech is started becomes longer, the 
correct recognition rate lowers; that was also the problem. 
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Disclosure of the Invention 

Considering such circumstances, the present invention aims to 
correct the silence acoustic model by the use of the information of 
the environment noise, and to hereby prevent the correct recognition 
rate from lowering as the time from the time point at which the speech 
recognition has been started till the time point at which the speech 
is started becomes longer. 

To solve such problems, the present invention provides a pattern 
recognizing device that comprises extracting means for extracting the 
pattern of the input data as the feature distribution, storing means 
for storing the stated number of models, classifying means for 
classifying the feature distribution that has been extracted by the 
extracting means into any of the stated number of models, and generating 
means for generating a model that corresponds to such a state that data 
do not exist, on the basis of the noise that has been input at the time 
just preceding the inputting of the data, and then updating that which 
is corresponding to it and stored in the storing means. 

Besides, the present invention provides a pattern recognizing 
method that comprises an extracting step of extracting the pattern of 
the input data as the feature distribution, a storing step of storing 
the stated number of models, a classifying step of classifying the 
feature distribution that has been extracted at the extracting step 
into any of the stated number of models, and a generating step of 
generating a model that corresponds to such a state that data do not 
exist, on the basis of the noise that has been input at the time just 
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preceding the inputting of the data, and then updating that which 
corresponds to it and has been stored at the storing step. 

In addition, the present invention provides a providing medium for 
providing the program that can be read by a computer which causes the 
pattern recognizing device to execute the processing that includes an 
extracting step of extracting the pattern of the input data as the 
feature distribution, a storing step of storing the stated number of 
models, a classifying step of classifying the feature distribution that 
has been extracted at the extracting step into any of the stated number 
of models, and a generating step of generating a model that corresponds 
to such a state that data do not exist, on the basis of the noise that 
has been input at the time just preceding the inputting of the data, 
and then updating that which corresponds to it and has been stored at 
the storing step. 

As a result of this, according to such a pattern recognizing device, 
a pattern recognizing method, and a providing medium, the pattern of 
the input data is extracted as the feature distribution, and the stated 
number of models are stored, and then the extracted feature distribution 
is classified into any of the stated number of models . Besides, a model 
that corresponds to such a state that any data do not exist is generated 
on the basis of the noise that has been input at the time just preceding 
the inputting of the data, and that which is corresponding to it and 
which has been being stored is updated- In this way, it becomes possible 
to prevent the correct recognition rate from lowering as the time from 
the time point at which the speech recognition has been started till 
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the time point at which the speech is started becomes longer. 

Brief Description of the Drawings 

Fig. 1 is a block diagram showing an example of the configuration 
of a speech recognition device to which the present invention has been 
applied. 

Fig. 2 is a diagram used for explaining the operation of the 
noise-observation-section extracting division of Fig. 1. 

Fig. 3 is a block diagram showing an example of the detailed 
configuration of the feature extracting division of Fig. 1. 

Fig. 4 is a block diagram showing an example of the detailed 
configuration of the speech recognizing division of Fig. 1. 

Fig. 5 is a diagram used for explaining the operation of the speech 
recognizing division . 

Fig. 6 is a diagram used for explaining the operation of the 
silence-acoustic-model correcting division of Fig. 1. 

Fig. 7 is a diagram used for explaining the operation of the 
silence-acoustic-model correcting division of Fig. 1. 

Fig. 8 is a diagram showing the experimental results of speech 
recognition of the speech recognition device to which the present 
invention has been applied. 

Best Mode for Carrying Out the Invention 

An example of the configuration of the speech recognizing device 
to which the present invention has been applied will be explained, with 
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reference to the Fig, 1. In this speech recognizing device, a 
microphone 1 gathers spoken voice that is the object of recognition 
together with the environment noise, and outputs it to a framing 
division 2. The framing division 2 takes the voice data, which is input 
from the microphone 1, with the stated time interval (for instance, 
10 ms), and outputs the taken data as the data of 1 frame. The voice 
data of 1 frame unit, which is the output of the framing division 2, 
is supplied to a noise-observation-section extracting division 3 and 
a feature extracting division 5 as an observation vector a, the 
components of which are the respective time series of voice data that 
compose the very frame. 

Hereinafter, an observation vector that is the t-th frame voice 
data is designated as a(t), for convenience. 

The noise-observation-section extracting division 3 performs 
buffering of the framed voice data that is input from the framing 
division 2, during only the stated time (a duration which corresponds 
to M frames or more) ; a section from the instant t b at which a 
press-to-talk switch 4 has been turned to the ON position until the 
instant t a which precedes the instant t b by M frames is referred to 
as the noise observation section Tn, as shown in Fig. 2; the 
noise-observation-section extracting division 3 extracts the 
observation vector a of M frames in the noise observation section Tn, 
and outputs it to the feature extracting division 5 and to a 
silence-acoustic-model correcting division 7. 

The press-to-talk switch 4 is turned to its ON position by the user 
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himself, at the time when he intends to start speaking. Therefore, 
the spoken voice is not included in the voice data of the noise 
observation section Tn, which precedes the instant t b at which the 
press-to-talk switch 4 has been turned to the ON position, and only 
the environment noise exists. Besides, a section from the instant t 
b at which the press-to-talk switch 4 has been turned to the ON position 
till the instant t d at which the press-to-talk switch 4 has been turned 
to the OFF position is referred to as a speech recognition section, 
and the voice data of the section is treated as the object of speech 
recognition . 

On the basis of the voice data that is input from the noise- 
observation-section extracting division 3 and including only the 
environment noise of the noise observation section Tn, the feature 
extracting division 5 eliminates the environment noise components out 
of the observation vector a of the speech recognizing section succeeding 
the instant t b that is input from the framing division 2, and then 
extracts its feature quantity. In other words, the feature extracting 
division 5 performs, for instance, Fourier transformation with respect 
to true voice data (the environment noise has been eliminated) that 
is treated as the observation vector a, and obtains its power spectrum, 
and then calculates a feature vector y whose components are the 
respective frequency components of the power spectrum. In this 
connection, the power-spectrum calculating method is not limited to 
that which is based on Fourier transformation. That is, the power 
spectrum can be obtained by the other methods, such as so-called 
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filter-bank method. 

In addition, on the basis of the calculated feature vector y, the 
feature extracting division 5 calculates a parameter Z that represents 
the distribution on the feature vector space (hereinafter, this is 
referred to as a feature distribution parameter) that is obtained when 
the voice included in the voice data which is the observation vector 
a has been mapped to the space of the feature quantity (the feature 
quantity space) , and then supplies the parameter Z to a speech 
recognizing division 6. 

Fig. 3 is showing an example of the detailed configuration of the 
feature extracting division 5 of Fig. 1. The observation vector a, 
which is input from the framing division 2, is supplied to a 
power-spectrum analyzing division 11 in the feature extracting division 
5. On the power-spectrum analyzing division 11, for instance, Fourier 
transformation is performed by FFT (fast Fourier transformation) 
algorithm, with respect to the observation vector a, in this way, the 
power spectrum that is the feature quantity of the speech is extracted 
as the feature vector. In this preferred embodiment, the observation 
vector a, which is the voice data of 1 frame, is translated into the 
feature vector that is comprised of D components (D-dimensional feature 
vector) . 

At here, the feature vector that is obtained from the observation 
vector a(t) of the t-th frame is designated as y(t) . Besides, out of 
the feature vector y(t), the spectrum component of true speech is 
designated as x(t), and the spectrum component of environment noise 



8 




is designated as u(t) . In this case, the spectrum component of true 
speech x(t) is represented with the following equation, eq. 1. 

x(t)=y(t) -u(t) (1) 

Wherein, it is assumed that the environment noise has an irregular 
characteristic, and the observation vector a(t) is such a voice data 
yg that the environment noise has been added to true speech component. 

lyi 5 - 

Sj On the other hand, the voice data (environment noise), which is 

yf! input from the noise-observation-section extracting division 3, is 

a S 5 

yj input to a noise-characteristic calculating division 13 in the feature 

0 detecting division 5. On the noise-characteristic calculating 

p division 13, the characteristics of the environment noise in the noise 

1 n 

p observation section Tn is obtained. 

That is, at here, mean value (mean vector) and variance (variance 
matrix) of the environment noise are obtained on the noise- 
characteristic calculating division 13, wherein it is assumed that the 
distribution of the power spectrum u(t) of the environment noise at 
the speech recognition section is identical to that of the environment 
noise at the noise observation section Tn which is just preceding the 
speech recognition section, and it is also assumed that the distribution 
is normal distribution. 

The mean vector ii 1 and the variance matrix Z 1 can be obtained on 
the basis of the following equations, eq. 2 and eq. 3. 
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M $ (i)=i I y(t)(i) 

I ' (*-i> — Eff- 2 (y(t)(i)-/i 1 (i))(y(t)(j)- ju ' (j)) 

M t-i (2: 



Where M f (i) represents the i-th component of the mean vector 
li f (i = 1, 2, m ~, D) . Besides, y(t) (i) represents the i-th component 

~™ 

of the t-th frame's feature vector. In addition, Z f (i, j) represents 

y s 

SJ the i-th row and the j-th column component of the variance matrix 

ffi Z\ = 2, D) . 

At here, in order to reduce the quantity of calculation, it is 
assumed that the respective components of the feature vector y are 
uncorrelated to each other, with respect to the environment noise. In 
this case, as shown in the following equation, the. variance matrix 
Z 1 other than the diagonal elements become 0. 

Z' (i, j)=0, i*j M3) 

In this way, the mean vector // 'and the mean value S r , which are 
the characteristics of the environment noise, are obtained on the 
noise-characteristic calculating division 13, and then supplied to a 
feature-distribution-parameter calculating division 12 . 

On the other hand, the output of the power-spectrum analyzing 
division 11, that is the feature vector y of the spoken voice that 
includes the environment noise, is supplied to the feature- 
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distribution-parameter calculating division 12. On the feature- 
distribution-parameter calculating division 12, the feature 
distribution parameter that represents the distribution of the power 
spectrum of true speech (the distribution of the estimated value) is 
calculated, on the basis of the feature vector y that is given from 
the power-spectrum analyzing division 11 and the characteristics of 
the environment noise that is given from the noise-characteristic 
calculating division 13. 

That is, on the feature-distribution-parameter calculating 
division 12, based on the presumption that the distribution of the power 
spectrum of true speech is normal distribution, the mean vector f and 
the distribution matrix X V of that are calculated as the feature 
distribution parameter, in accordance with the following equations, 
eq. 4 through eq. 7. 



£<t){i)-E[x<t)(i)] 

= E[y(t)(i)-u(t)(i)] 





P(u(t)(i)) 



du(t)(i) 



y(t)(i) f y(t>(l) P(u(t)(i))du(t)(i)- r yr<tH 'u(t)(i)P(u(t)(i))du(t)(i) 

•* 0 •'n 



y(t)(i) 

P(u(t)(i))du(t)(i) 



=y(t)(i)- 





P(u(t)(i))du(t)(i) 



(4) 
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If i=j , then 

Y(t)(i.j)=V[x{t)(i)l 



= E[(x(t)(i)) 2 ]-(E[x(t)(i)]) 2 
(=E[(x(t)(i)) 2 ]-(£(t)(i)) 2 ) 



If i^j, then 

T(t)(i.j)=0 



(5) 



E[(x(t)(i)) 2 ]=E[(y(t)(i)-u(t)(i)) 2 ] 

ryCXi) _ P(u(t)(i)) 
= 1 (y(t)(i)-u(1)(i)) 2 du(t)(i) 



P(u(t)(i))du(t)(i) 

o 

1 f „ r*ww 



•y<»)(i)_ , | wx ,x " J 0 

'0 



x^(y(t)(i)) 2 J P(u(t)(i))du(t)(i) 



P(u(t)(i))du(t)(i) 
r y(t)(i) 

-2y(t)(i) J u(t)(i)P(u(t)(i))du(t)(i) 

+ J o (u(t)(i)) 2 P(u(t)(i))du(t)(i) V 

J u(t)(i)P(u(t)(i))du(t)(i) 
={y(t)(i)) 2 -2y(t)(i) 2 



J Q P(u(t)(i))du(t)(i) 
J (u(t)(i)) 2 P(u(t)(i))du(t)(i) 



py(t)<i) 

J o P(u(t)(i))du(t)(i) 



(6) 
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Where f ( t ) (i , represents the 1 ^ h 
» «> in the t-th frame ,„ " e ^ C ° n — - «» -„ vector 

— , et „. ^ x(t ; — v aaueofwhatiswithin 

a ' x(t) (!) represents the i -m 

th <~ t o f th. Power 
h f " me x (t) ■ In u (t , ( i> 

«e i th component of the power spectrum of fh 

«« the i- th component of the * 

- «- t -th frame is u , t ( " "» — — 

-e environment nois 3 n " ^ ^ ^ ^ ^ 

XUise is normal disfn'Knf 

-Presentee: as shown in eq ? " tr ^«»' -d so P (u(t) is 

And, ^ (t Wi s \ ^ 

<^ dO) represents the i-th ™ 

cut; j. rn row and fhp s +-u 
component nf hh. D~th column 

^ nenc or the variance W(t) in th 

-Presents the variance of wha t i s 1^.7' !" 

—re-distribution-parameter caicuia^ d ^ ^ 

calculating division IP fh. 

vector f and the variance matri* V are th k 

•» the feature distribut obtained, for each frame, 

dlst "bution parameter that represents th. h • k 
upon the feature ve,-- ^Presents the distribution 

eature vector space of true speech (thi, 

<th " "«»• the 

"Pon the feature vector so aSSUI " ed ^ ^ribution 

-ter that the °' ^ ** —ribut.on, . 

t-t, the feature distribution parameter that has been 
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obtained in each frame of the speech recognition section is output to 
the speech recognizing division 6. Now, supposing that the speech 
recognition section is comprised of T frames, and the feature 
distribution parameters that have been obtained in the respective T 
frames are designated as z(t) = U(t),^(t) } (t=l, 2, T) , the 

feature-distribution-parameter calculating division 12 supplies the 
feature distribution parameter (sequence) Z= {z(l), z (2), z(T)} 
g to the speech recognizing division 6. 

Referring to Fig. 1 again, the speech recognizing division 6 
classifies the feature distribution parameter Z, which is input from 
the feature extracting division 5, into any of the stated number K of 
acoustic models and one silence acoustic model, and then outputs the 
result of the classification as the result of the recognition of the 
input voice. In other words, the speech recognizing division 6 stores, 
for instance, a discriminant function that is corresponding to the 
silent section (a function for discriminating whether the feature 
parameter Z should be classified into the silence acoustic model or 
not) and discriminant functions that are corresponding to the stated 
number K of words respectively (functions for determining that the 
feature parameter Z should be classified into any of acoustic models) , 
and calculates the values of the discriminant functions of the 
respective acoustic models, using the feature distribution parameter 
Z given from the feature extracting division 5 as the argument. And, 
an acoustic model (a word or the silent section) that has the maximum 
function value is output as the result of the recognition. 
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Fig. 4 shows an example of the detailed configuration of the speech 
recognizing division 6 of Fig. 1. The feature distribution parameter 
Z, which is input from the feature-distribution-parameter calculating 
division 12 of the feature extracting division 5, is supplied to 
discriminant-function computing divisions 21-1 through 21-k, and to 
a discriminant-function computing division 21-s. The 

discriminant-function computing division 21-k (k=l, 2, -, K) is storing 
a discriminant function Gk (Z) for discriminating a word that is 
corresponding to the k-th one of K acoustic models, and computes the 
discriminant function Gk (Z) , using the feature distribution parameter 
Z given from the feature extracting division 5 as the argument. The 
discriminant-function computing division 21-s is storing a 
discriminant function G s (Z) for discriminating the silent section that 
is corresponding to the silent acoustic model, and computes the 
discriminant function Gs (Z) , using the feature distribution parameter 
Z which is given from the feature extracting division 5 as the argument . 

In this connection, on the speech recognizing division 6, 
discrimination (recognition) of the words or the silent section, which 
are the classes, is performed by the use of HMM (Hidden Markov) method, 
for instance. 

Now, HMM method is explained with reference to Fig. 5. As shown 
in the figure, HMM has H states q, through q H ; as to transition of a 
state, only transition to itself and transition to the next state on 
the right side are allowed. The initial state is defined as the leftmost 
state qi , and the final state is defined as the rightmost state q H , 
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and state transition that occurs from the final state q H is inhibited. 
A model that do not include any transition to the left side is referred 
to as a left-to-right model; in speech recognition, a left-to-right 
model is used in general. 

Now, the model for discriminating k classes of HMM is referred to 
as the k classes model, hereinafter; the k classes model is prescribed 
by, for instance, a probability of staying in the state q h initially 
(the initial state probability) 7r k (q h ), a probability of staying in 
the state q 4 at a certain time (frame) t and transferring to the state 
q. at the next time tl (the transition probability a k (q { , q. ) , and such 
a probability that the state q. outputs the feature vector 0 at the 
time when state transition occurs from the state q. (the output 
probability) b k (q. ) (O) (h=l, 2, H) . 

And, in the case where a certain feature vector sequence 0 X , 0 
2 , *•• has been given, for instance, the class of a model that has the 
highest probability of observation of such feature vector sequence 
(observation probability) is determined to be the result of the 
recognition of the feature vector sequence. 

At here, this observation probability is obtained by the 
discriminant function G k (Z) . That is, the discriminant function G k 
(Z) is given with the following equation, eq. 8, in the optimal state 
sequence (a way for transferring of the optimal state) relative to the 
feature distribution parameter (sequence) Z= (z^ z 2 , z T } , as what 
serves for obtaining such a probability that such feature distribution 
parameter (sequence) Z= {z x , z 2 , **, z T } is observed. 
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g k (Z) = max n k (q^ • b f k (q y ){z^) • a k (q lf q 2 ) • b k (q 2 )(z 2 ) 

• • • a k (q T -i.qT) • b k (q T )(ZT) 

(8) 



Q 



Where b k 1 (q t ) (Z.) represents the output probability of the time 
when the output is the distribution that is represented with z.. To 
the output probability b k (s) (0 t ) that is a probability of outputting 
each feature vector at the time of state transition, at here, a normal 
distribution function is applied, as that which is uncorrelated to the 
component upon the feature vector space. In this case, when the input 
is a distribution that is represented by z t , the output probability 
b k 1 (s) (z t ) can be found with the following equation, eq. 9, using a 

probability density function P k m (s) (x) that is prescribed by the mean 
vector// k(s) and the variance matrixZ k (s), as well as a probability 

density function P ^ (t) (x) that represents the distribution of the t-th 
frame's feature vector (as used herein, the power spectrum) x. 



b k (s)(2 t ) = /p , (t)(x)pT(s)(x)dx 



= P(s)(i)(£(t)(i).Y(t)(i,i)) 
i-1 

k~1 ( 2.---,K : s-q lf q 2 - - • ,q T : T-1,2---,T 



(9) 



Where the interval of integration of the integration operation in 
the equation 9 is the whole of D-dimensional feature vector space (as 
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I = a 



(10) 



le mean vector 



y 
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given from the respective ones of the discriminant- function computing 
divisions 21-1 through 21-k and the discriminant-function computing 
division 21-s, the feature distribution parameter Z, that is, the class 
(the acoustic model) to which the input voice belongs is discriminated, 
by the use of, for instance, the decision rule shown in the following 
equation, eq. 11. 

J C(Z)=C k , if G k (Z)=max{Gi(Z)} 

^ (11) 

bj Where C(Z) represents the function for performing the 

q discrimination operation (processing) of discriminating the class to 

i u 

q which the feature distribution parameter Z belongs. Besides, Max in 

LP 

Q the right side member of the second expression of the equation 11 denotes 
the maximum value of the succeeding function value G. (Z) , (as used at 
this position, i = s, 1, 2, K) . 

The deciding division 22 decides the class in accordance with the 
equation 11, and then .outputs it as the recognition result of the input 
voice . 

Well, returns to Fig. 1 for reference. On the basis of the voice 
data (environment noise) of the noise observation section Tn, which 
is input from the noise-observation-section extracting division 3, the 
silence-acoustic-model correcting division 7 generates the 
discriminant function G s (Z) that is corresponding to the silence 
acoustic model stored in the speech recognizing division 6, and then 
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supplies it to the speech recognizing division 6. 

To put it concretely, in the silence-acoustic-model correcting 
division 7, the feature vector X is observed with respect to the 
respective frames of M frames of the voice data (environment noise) 
of the noise observation section Tn, which is input from the 
noise-observation-section extracting division 3, and the feature 
distribution of those is generated. 

{F 1 (X) , F 2 (X) , F M (X) } (12) 

In this connection, the feature distribution {F . (X), i = l, 2, 
M} is a probabilistic density function, and so, hereinafter, it is 
referred to as the silence feature distribution PDF, too. 

Next, as shown in Fig. 7, the silence feature distribution PDF is 
mapped to the probability distribution that is corresponding to the 
silence acoustic model F s (X) , in accordance with the following equation, 
eq . 13 . 



F s (X) = V(F, (X) , F 2 (X) , F M (X) ) (13) 

Where V is a correcting function (mapping function) for mapping 
the silence feature distribution PDF{F. (X) , i=l, 2, **, M} to the silence 
acoustic model F s (X) . 

As to this mapping, according to the description of the silence 
feature distribution PDF, wide variety of methods are possible, for 
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instance, 



F S (X) - X/?i(Fi(X),F 2 (X).- • * ,Fm(X),M) • F,(X) 

i-1 



(14) 



= Zfli • Fi(X) 

i-1 



(15) 



Where & t (F X (X) , F 2 (X) , F M (X) , M) is a weight function of each 

silence feature distribution )3 . , and designated as j3 . , hereinafter. 

Besides, the weight function )3 . should be that which satisfies the 
condition of the following equation, eq. 16. 

M M 

I /?i(Fi(X),F 2 (X),- - -,F M (X),M) - « 1 (16) 

Assuming that the probability distribution of the silence acoustic 
model F s (X) is normal distribution and that the components that compose 
the feature vector of each frame are uncorrelated to each other, the 
covariance matrix Z i of the silence feature distribution PDF {F. (X) , 
i=l/ 2, M} becomes a diagonal matrix. The precondition for this 

assumption is that the covariance matrix of the silence acoustic model 
is also a diagonal matrix. Therefore, if the components that compose 
the feature vector of each frame are uncorrelated, the silence feature 
distribution PDF {F . (X), i = l, 2, M} becomes the normal distribution 
G(E., Z,) that has the mean and the variance that are corresponding 
to each component. Where E. is the expected value of F. (X), and Z. 
is the covariance matrix of F t (X) . 
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Besides, if the mean of the silence feature distribution that is 
corresponding to M frames of the noise observation section Tn is 
designated as n , , and the variance of it is designated as a i 2 , then 
the probability density function of the silence feature distribution 
can be designated as the normal distribution G ( u . a . 2 ) M=l 2 

M). Therefore, the normal distribution G ( u s , o s 2 ) (this is 
corresponding to the above-mentioned G s (Z)) of the silence acoustic 
model that is computed in accordance with the following various method 
by the use of the mean// , and the variance a t 2 that are corresponding 
to each frame becomes the approximate distribution of the silence 
acoustic model F s (X) shown in Fig. 7. 

The first method for calculating the normal distribution G(u 

2 

o s ) of the silence acoustic model is such a method that the mean value 
M , of the silence acoustic model is obtained from the mean of all 

H , as shown in the following equation, eq. 17, and the variance a. 
2 

of the silence acoustic model is obtained from the mean of all 
2 

o i as shown in the following equation, eq. 18, by the use of the 
silence feature distribution {G { n . , o t 2 ), i=i, 2, m} 



_ _a_ m 

^ s " ~M* Mi (17) 

„2 _ _b_^ 2 

M i-l •. (18) 



Where a and b are the coefficients, the optimal values of those 
'are determined by simulations. 
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^The second method for calculating the normal distribution G < * . , 
* s 2 ) of the silence acoustic model is such a method that the mean value' 
M . and the variance a ^ of the silence acoustic model are calculated 
in accordance with the following equations, eg. 19 and eg. 20, by the 
use of only the expected value M , of the silence feature distribution 
(GU,, or, 2 ), i = i, 2 , M) j m . 

a M 
^ s " M~ * J Mi 

(19) 



» = b ' ^ TOT ' \ 



I) 

'-« (20) 



Where a and b are the coefficients, the optimal values of those 
are determined by simulations. 

^The third method for calculating the normal distribution G<*„ 
" , 2 ) of the silence acoustic model is such a method that the mean value 
M , and the variance „ . 2 ot the silence acoustic model are calculated 
on the basis of the combination of the silence feature distribution 
,, <• , 2 ), i=l, 2, •-, M) } . 
in this method, the probabilistic statistic of each silence feature 
distribution G(„,, a, 2 ) ls designated as X,. 

iX " X » V ,21, 

At this position, when the probabilistic statistic of the normal 
distribution G ,„..„. 2 , of the silence acoustic ^ ^ ^ 
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as X s/ the probabilistic statistic X s can be represented by the linear 
combination of the probabilistic statistic X. and the weight function 
J3 . , as shown in the following equation, eq. 22. In this connection, 
the weight function j3 . satisfies the condition of the equation 16. 



M 

2/?i'Xj (22) 

i-t 



o 

And, the normal distribution G ( /i s , a s ) of the silence acoustic 
model is represented as shown in the following equation, eq. 23. 

? m M 

G( Mst O -G(I.0\m\. (23) 

i-1 i-1 



Besides, to generalize the equation 23, the weight function j3 . 

is assumed to be { jS . =1/M, i = l, 2, M} , and the mean value \i s and 

2 

the variance a s are multiplied by the respective coefficient, 
a M 

Ms - TOT ' j?/' (24) 

<r s = -rar * 2^ i (25) 



Where a and b are the coefficients, the optimal values of those 
are determined by simulations. 



In the fourth method for calculating the normal distribution G ( \i 
2 

, a s ) of the silence acoustic model, a statistical population Q 
= {f., .} that is corresponding to the probabilistic statistic X. of 
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the silence feature distribution {G ( 11 t , a. 2 ), i=l, 2, M} is 

assumed . 

Wherein, if 

{N.=N; i=l, 2, — , M} , 

the mean value ji . can be found by the following equation, eq. 26, and 

o 

the variance a. can be found by the following equation, eq. 28. 
_L M 

*i " N jJ/U (26) 



•?= 4- z(fi 2 i-^i) 2 



a ' = jf^' 1 '' (27) 



■< M g 2 

Tr.^ f >,i~^i (28) 



Deformation of the equation 28 yields the following equation, eq. 

29. 

1 2. 2 

TT | ? 1 , '.|- ff ' + "» (29) 



Regarding the sum of the statistical population Q, 

M 
i-1 

we have the following equations 30 and 31 from the equation 26, and 
also we have the following equations 32 through 31 from the equation 
29. 
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. 1 M N 2 
*s - W i ? lj ? 1 (,i ^" ;/s) (32) 

1 M N 2 2 

" W,?,,?/ 1 -'"^ < 33 > 

= tJt z o4) 

Where the equation 31 and the equation 34 are multiplied by the 
respective coefficients and used, in practice. 

Ms = M (35) 

<r s = b * ( "ST s) (3 6) 

Where a and b are the coefficients, the optimal values of those 

are determined by simulations. 

Besides, as shown in the following equation, eq. 27, only the 
2 

variance a s may be multiplied by the coefficient. 

2 b ^ 2 , 1 ^ 2 2 

* s = ~M",5 * i + TF i J 1 > ii ~>' 8 (37) 
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Next, the operation of the speech recognizing device is explained. 

The voice data (the spoken voice which is the object of recognition 
and which is including the environment noise) that has been collected 
by the microphone 1 is input to the framing division 2, and, at there, 
the voice data is partitioned into, frames. The voice data of the 
respective frames are sequentially supplied to the noise- 
observation-section extracting division 3 and the feature extracting 
division 5, as the observation vector a. On the noise- 
observation-section extracting division 3, the voice data (environment 
noise) of the noise observation section Tn preceding the instant t 
b at which the press-to- talk switch 4 has been turned to the ON position 
is extracted, and supplied to the feature extracting division 5 and 
the silence-acoustic-model correcting division 7. 

On the feature extracting division 5, sound analysis of the voice 
data, which is the observation vector a that has been delivered from 
the framing division 2, is performed, so as to find its feature vector 
y. In addition, on the feature extracting division 5, the feature 
distribution parameter Z that represents the distribution in the 
feature vector space is calculated on the basis of the found feature 
vector y, and supplied to the speech recognizing division 6. On the 
speech recognizing division 6, the values of the discriminant functions 
of the acoustic models that are corresponding to the silent section 
and K words respectively (K is the stated number) are computed, using 
the feature distribution parameter that is given from the feature 
extracting division 5, and then the acoustic model whose function value 
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is maximum is output as the result of the recognition of the speech. 
Besides, on the speech recognizing division 6, using the discriminant 
function that is corresponding to the silence acoustic model which is 
input from the silence-acoustic-model correcting division 7, the 
discriminant function that is corresponding to the silence acoustic 
model which has been being stored until that time is updated. 

As stated above, the voice data that functions as the observation 
vector a is translated into a feature distribution parameter Z that 
represents the distribution in the feature vector space that is the 
space of the feature quantity, and so the feature distribution parameter 
becomes that which has regarded the distribution characteristic of the 
noise included in the voice data; besides, the discriminant function 
that corresponds to the silence acoustic model for recognizing the 
silent section is updated, on the basis of the voice data of the noise 
observation section Tn just preceding the speech; therefore, it is able 
to improve the correct recognition rate of speech recognition. 

Next, referring to Fig. 8, it is illustrating the results of the 
experiment of measuring the variation of the correct speech recognition 
rate at the time when the silent section Ts, which is the interval from 
turning On of the press-to-talk switch until starting of the speech, 
is varied. 

In Fig. 8, the curve denoted "a" shows the result based on the 
conventional method wherein the silence acoustic model is not 
corrected, the curve denoted "b" shows the result based on the first 
method, the curve denoted"c" shows the result based on the second method, 
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the curve denoted V shows the result based on the third method, and 
the curve denoted "e" shows the result based on the fourth method. 

Th e conditions of the experiment are as shown below. The voice 
data that are to be recognized have been collected at the inside of 
a car that have been being run on a high way. The noise observatron 
action Tn is about 0 . 2 second that corresponds to 20 fra.es . The silent 
section Ts has been set to 0.05 second, 0.1 second, 0.2 second, 0.3 

t„ f .,h,re extraction of the voice data, 
second, and 0.5 second. In feature ext 

analysis has been performed with MFCC (Mel-Frequency Cepstral 
coefficients, domain. The speakers of the voice that is the object 
of recognition have been 8 persons, 4 men and 4 women, and 303 words 
per person have been spoken in a discrete manner. The task is 5000 
words, in large vocabulary discrete Japanese. The acoustic models are 
b ased on HMM, and are that which have been previously learned by the 

Tn the speech recognition, the 
use of the satisfactory voice data. In tne sp 

b eam width has been set to 3000 in viterbi search method. 

In the cases of the first, the second, and the fourth methods, the 
coefficient a has been set to 1.0, and the coefficient b has been set 
to 0.1. in the case of the third method, the coefficient a has been 
set to 1.0, and the coefficient b has been set to 1.0, too. 

As will be seen from rig. 8, according to the conventional method 
(t he curve a, , the correct speech-recognition rate remarkably lowers 
as the silent section Ts becomes longer, however, according to the first 
m ethod through the fourth method of the present invention ( the curve 
b through the curve e, , the correct speech-recognition rate shows only 
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a little lowering even though the silent section Ts becomes longer. 

That is, according to the present invention, even if the silent section 
Ts varies, it is able to maintain the correct speech-recognition rate 
to a certain level. 

A speech recognizing device to which the present invention has been 
applied has been hereinbefore explained; the speech recognizing device 
like this can be applied to the wide variety of devices, such as a 
yg voice-entry car navigation device. 

S| In this embodiment, the feature distribution parameter has been 

yff found wherein the distribution characteristic of the noise has been 

a 5 s 

yy regarded; this noise may include the others, such as the characteristics 

q of the communication channel of the case where the speech recognition 

p is performed with respect to the voice that is transmitted via a 

IF 

q communication channel, for instance, a telephone line, as well as the 

U 

external noise under the environment at which the speech is performed. 

Besides, the present invention is applicable to the case of 
performing pattern recognition such as picture recognition, as well 
as speech recognition. 

The computer program for executing said each processing can be 
supplied to a user, via a network providing medium such as the Internet 
and a digital satellite, as well as a providing medium that is comprised 
of an information recording medium such as a magnetic disk and a CD-ROM. 



Industrial Applicability 

The present invention is applicable to a speech recognizing device. 
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